mwetoolkit+sem: Integrating Word Embeddings in the mwetoolkit for Semantic MWE Processing

نویسندگان

  • Silvio Cordeiro
  • Carlos Ramisch
  • Aline Villavicencio
چکیده

This paper presents mwetoolkit+sem: an extension of the mwetoolkit that estimates semantic compositionality scores for multiword expressions (MWEs) based on word embeddings. First, we describe our implementation of vector-space operations working on distributional vectors. The compositionality score is based on the cosine distance between the MWE vector and the composition of the vectors of its member words. Our generic system can handle several types of word embeddings and MWE lists, and may combine individual word representations using several composition techniques. We evaluate our implementation on a dataset of 1042 English noun compounds (Farahmand et al., 2015), comparing different configurations of the underlying word embeddings and word-composition models. We show that our vector-based scores model non-compositionality better than standard association measures such as log-likelihood.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

mwetoolkit: a Framework for Multiword Expression Identification

This paper presents the Multiword Expression Toolkit (mwetoolkit), an environment for type and language-independent MWE identification from corpora. The mwetoolkit provides a targeted list of MWE candidates, extracted and filtered according to a number of user-defined criteria and a set of standard statistical association measures. For generating corpus counts, the toolkit provides both a corpu...

متن کامل

Extraction of Nominal Multiword Expressions in French

Multiword expressions (MWEs) can be extracted automatically from large corpora using association measures, and tools like mwetoolkit allow researchers to generate training data for MWE extraction given a tagged corpus and a lexicon. We use mwetoolkit on a sample of the French Europarl corpus together with the French lexicon Dela, and use Weka to train classifiers for MWE extraction on the gener...

متن کامل

Multiword Expressions in the wild? The mwetoolkit comes in handy

The mwetoolkit is a tool for automatic extraction of Multiword Expressions (MWEs) from monolingual corpora. It both generates and validates MWE candidates. The generation is based on surface forms, while for the validation, a series of criteria for removing noise are provided, such as some (language independent) association measures.1 In this paper, we present the use of the mwetoolkit in a sta...

متن کامل

UFRGS&LIF at SemEval-2016 Task 10: Rule-Based MWE Identification and Predominant-Supersense Tagging

This paper presents our approach towards the SemEval-2016 Task 10 – Detecting Minimal Semantic Units and their Meanings. Systems are expected to provide a representation of lexical semantics by (1) segmenting tokens into words and multiword units and (2) providing a supersense tag for segments that function as nouns or verbs. Our pipeline rule-based system uses no external resources and was imp...

متن کامل

Automatic extraction and evaluation of MWE

This short paper aims at presenting a method for automatically extracting and evaluating MWE in the Europarl corpus. For this purpose we make use of mwetoolkit and utilize its output to find rules for the automatic evaluation of MWE. We then developed an XML parser to evaluate MWE candidates against those rules and also against online dictionaries. A sample of the results was manually evaluated...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016